Choosing document structure weights

نویسنده

  • Andrew Trotman
چکیده

Existing ranking schemes assume all term occurrences in a given document are of equal influence. Intuitively, terms occurring in some places should have a greater influence than those elsewhere. An occurrence in an abstract may be more important than an occurrence in the body text. Although this observation is not new, there remains the issue of finding good weights for each structure. Vector space, probability, and Okapi BM25 ranking are extended to include structure weighting. Weights are then selected for the TREC WSJ collection using a genetic algorithm. The learned weights are then tested on an evaluation set of queries. Structure weighted vector space inner product and structure weighted probabilistic retrieval show an about 5% improvement in mean average precision over their unstructured counterparts. Structure weighted BM25 shows nearly no improvement. Analysis suggests BM25 cannot be improved using structure weighting.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning Document Image Features With SqueezeNet Convolutional Neural Network

The classification of various document images is considered an important step towards building a modern digital library or office automation system. Convolutional Neural Network (CNN) classifiers trained with backpropagation are considered to be the current state of the art model for this task. However, there are two major drawbacks for these classifiers: the huge computational power demand for...

متن کامل

A Document Weighted Approach for Gender and Age Prediction Based on Term Weight Measure

Author profiling is a text classification technique, which is used to predict the profiles of unknown text by analyzing their writing styles. Author profiles are the characteristics of the authors like gender, age, nativity language, country and educational background. The existing approaches for Author Profiling suffered from problems like high dimensionality of features and fail to capture th...

متن کامل

Choosing weights for a complete ranking of DMUs in DEA and cross-evaluation

Conventional data envelopment analysis (DEA) assists decision makers in distinguishing between efficient and inefficient decision making units (DMUs) in a homogeneous group. However, DEA does not provide more information about the efficient DMUs. One of the interesting research subjects is to discriminate between efficient DMUs. The aim of this paper is ranking all efficient (extreme and non-ex...

متن کامل

Document Clustering Using Term Weights and Class Label Terms Based on Semantic Features

Clustering of class labels can be generated automatically, which is much lower quality than labels specified by human. In this paper, we propose a new enhancing document clustering method using terms of class label and term weights. The terms of class label can well represent the inherent structure of document clusters by non-negative matrix factorization (NMF). It can also improve the quality ...

متن کامل

Optimal Structure Weighted Retrieval

Improving ranking functions for structured information retrieval has received much attention since the inception of XML. Weighting document structures is one method providing significant improvement – but how good can these improvements be? Optimal structure weighted retrieval occurs when each query is processed using the optimal set of weights for that query. Optimal retrieval for a set of que...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Inf. Process. Manage.

دوره 41  شماره 

صفحات  -

تاریخ انتشار 2005